Posters - Schedules
Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7
minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19
and no later than July 23. All registered conference participants will have access to the poster and presentation
through the conference and content until October 31, 2021. There are Q&A opportunities through a chat
function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.
Information on preparing your poster and poster talk are available at:
https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters
Ideally authors should be available for interactive chat during the times noted below:
View Posters By Category
Session A: Sunday, July 25 between 15:20 - 16:20 UTC |
Session B: Monday, July 26 between 15:20 - 16:20 UTC |
---|---|
Session C: Tuesday, July 27 between 15:20 - 16:20 UTC |
Session D: Wednesday, July 28 between 15:20 - 16:20 UTC |
---|---|
Session E: Thursday, July 29 between 15:20 - 16:20 UTC |
---|
Short Abstract: One of the most crucial steps in clinical genetics pipelines is variant annotation and prioritization in which we attempt to enrich an individual’s variation with information from public genomic databases. Despite the plethora of available methods for information extraction from biomedical text, they rarely take part in the annotation/prioritization step of typical NGS pipelines. This is because existing methods are not suited for mass query of the complete genome variation of an individual. Here we present VCF2PHEN, an open tool that builds a graph from the BioC corpus comprising of all open and extensively pre-annotated PubMed articles in less than 10 hours. In this graph nodes represent Articles (n=19M), Chemicals (n=350K), Diseases (n=11K), Genes (n=37K), Mutations (n=422K) and Transcripts (n=127K), interconnected through 106 million edges. All mutations have been homogenized and validated through VariantValidator. The graph can be queried and explored through the Cypher language that is served and visualized through the Neo4j graph database engine. Through this engine we can query the entirety of variants (~100K) identified in NGS experiments in a practical timescale. The result of this query is a personalized graph containing all existing bibliographic evidence linking the individual’s genetic profile with known diseases and chemical/drug interactions.
Short Abstract: Metagenomics is a culture-independent approach for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally and/or taxonomically), either from a longitudinal study or between independent studies can provide clues into how the microbiota have adapted to a particular environment. However, to understand the impact of environmental factors on the microbiome, it is important to also account for experimental confounding factors. Metagenomics databases, such as MGnify , provide analytical services to enable the consistent functional and taxonomic annotations to mitigate bioinformatic confounding factors. However, a recurring challenge is that key metadata about the sample (e.g. location, pH) and molecular methods used to extract and sequence the genetic material are often missing from the sequence records. Nevertheless, this missing metadata may be found in publications describing the research. When identified, the additional metadata can lead to a substantial increase in data reuse and greater confidence in the interpretation of observed biological trends. Here, we describe a machine learning framework that automatically extracts relevant metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework includes 3 processes: (1) literature classification and triage, (2) named entity recognition (NER) and (3) database enrichment.
Short Abstract: We present an analysis of supplementary materials of PubMed Central (PMC) articles and show their importance in indexing and searching biomedical literature, in particular for the emerging genomic medicine field. On a subset of articles from PubMed Central, we use text mining methods to extract MeSH terms from abstracts and from text-based supplementary materials, such as spreadsheets and doc(x). We find that the recall of MeSH annotations increases about 5.9 percentage point (+20% on relative percentage) by considering supplementary materials compared to using only abstracts. We further compare the supplementary material annotations with annotations found in the article's full-text and we find out that the recall of MeSH terms increases by 1.5 percentage point (+3% on relative percentage). Additionally, we analyze genetic variant mentions in abstracts and full-texts and compare them with mentions found in text-based files in the supplementary materials. We find that the majority of variants (about 99%) are found in text-based files of supplementary materials. Our study also highlights which types of information appear in spreadsheets that are often missing in abstracts. In conclusion, we suggest that supplementary data should receive more attention from the information retrieval community, in particular in life and health sciences.
Short Abstract: Single-cell RNA-sequencing allows us to measure gene expression levels in thousands of individual cells from a heterogeneous tissue sample simultaneously, but assigning a cell-type label to each single-cell transcriptome after sequencing is challenging. Researchers often use known cell-type marker genes to make these assignments, but curating lists of marker genes from the scientific literature is time consuming, and inconsistent marker gene lists from different research groups hinder reproducibility. We hypothesize that natural language processing (NLP) can be used to identify useful markers for thousands of cell types in an unbiased manner. To test this hypothesis, we leveraged millions of PubMed abstracts to generate numerical vector representations of ~15k ENSEMBL genes and each of the thousands of cell types described in the Cell Ontology. We then used supervised and unsupervised methods to predict the relationships between genes and cell-types, giving us a score for each gene/cell-type pair. To ensure the scores were cell-type-specific, they were normalized among groups of related cell types. We found the top ranked normalized NLP markers outperformed hand-curated markers when identifying PBMC cell types, providing a proof of principle that NLP approaches can create unbiased lists of cell-type-specific marker genes useful for annotating single-cell RNA-seq data.
Short Abstract: Medical research can benefit from information in electronic health records, but, as they often exist as unstructured free text, processing with machine learning tools is challenging. Transformer-based models like BERT represent a promising approach to tackle this issue, as they achieve state-of-the-art results in many domains; however, applications in the biomedical context focus mainly on the English language. While German language models such as GermanBERT and gottBERT are available, domain-specific models for biomedical data are yet to be developed.
In this study, we critically assessed the suitability of existing and new models for the biomedical domain. We used five German language models, pre-trained a new model on a newly-assembled biomedical corpus, and compared them with each other. For the evaluation, we annotated a new dataset of clinical documents and used it alongside two other corpora (GGPONC and JSynCC) for named-entity recognition and sequence classification.
Despite the small corpus available for pre-training, the domain-specific model provided better prediction performances than an existing rule-based system. However, unspecific German language models were not outperformed by domain-specific ones, suggesting such models as a first opportunity for the German-speaking region. Higher performances of domain-specific models might be achievable if larger corpora for pre-training were available.
Short Abstract: The curation of biomedical texts is an essential task in the biomedical field. Automating the curation process can go a long way in improving the accuracy of annotations that are deposited in widely used resources.
We are proposing a new method for annotating abstracts and biomedical texts, OntoSearch. OntoSearch exploits the controlled vocabularies of various ontologies to identify terms and events in literature that are used to create a searchable Semantic Network.
OntoSearch also uses a number of Natural Language Processing (NLP) rules to identify the terms from the vocabularies in the abstracts which are marked as entities in the annotated output. The relationships are stored in a Graph structure which allows OntoSearch to deduce complex interactions.
OntoSearch is evaluated against the abstracts in BioCreative. The first prototype achieved an F-Score of 35%, where the average F-Score of the top performing methods is 45% and the range of the F-scores for the competing methods is between 32% and 57%.
We plan to improve the capability of OntoSearch by refining the NLP rules to capture genes and gene products better. We also plan to use Deep Learning approaches to improve the annotation capability.
Short Abstract: Many cell type annotation methods use labeled reference cell atlases or manually-curated lists of marker genes to infer which cell type a new cluster represents, but these methods can introduce study and selection bias based on what the reference atlases or marker gene lists include. In addition, these methods are often incapable of annotating cells to cell types not found in the reference atlas or the list of marker genes. Here, we propose an approach that uses natural language processing of millions of PubMed abstracts to associate potential marker genes with all of the thousands of cell types described in the Uberon Ontology automatically and without bias. First, we create numerical representations of genes and cell types by embedding them in a shared high-dimensional space based on the text of over 17 million biomedical abstracts in PubMed and the curated hierarchical relationships between cell types in the Uberon Ontology. We then train a deep neural network to associate gene embeddings with ontology-referenced cell types. Our cross-validation results show that our method can extend known marker gene lists to encompass novel gene/cell type relationships, even when the gene and/or cell type has not been previously studied in this context.